GNP Computers High Availability White Paper

If you would like a hardcopy of this white paper, complete with clear versions of the technical drawings, please send email to that effect to webmaster@gnp.com.

1. INTRODUCTION
2. OVERVIEW OF THE SYSTEM: 2.1 Design Objectives; 2.2 System Components
3. COMPONENT CONNECTIVITY: 3.1 Application Network; 3.2 Maintenance Network
4. SOFTWARE ARCHITECTURE: 4.1 Normal Operating Conditions; 4.2 Software Layers; 4.3 HA and Application Software; 4.4 CPU State Transitions
5. COMPONENT FAILURE AND REPLACEMENT: 5.1 Disk Failure; 5.2 Disk Replacement; 5.3 SCSI Switch or RAID Controller Failure; 5.4 SCSI Switch or RAID Controller Replacement; 5.5 CPU or Boot Disk Failure; 5.6 CPU Replacement; 5.7 Fan Failure and Replacement; 5.8 Software Failure; 5.9 Software Replacement and Upgrades
6. CONCLUSION
Appendix A Solstice DiskSuite
Appendix B Appendix B Redundant Arrays of Independent Disks (RAID)

1. Introduction

[ IMAGE ]

This document describes the hypothetical system called the WorkServer Reference Platform which is illustrated above. This system could be implemented using existing technology developed by GNP Computers for its Telco WorkServer^TM product line, a set of modular components for implementing high-availability computer platforms for the telecommunications industry. The system described here, an Internet mail server, is not a product that GNP Computers currently sells, nor is it one that GNP has implemented for any of its customers in precisely the form discussed. Nevertheless, it is representative of the type of platform that can be, and has been implemented using WorkServer technology together with other off-the-shelf third-party products. The following discussion is designed to present some of the issues that are faced in constructing a high-availability computer platform, and to illustrate how the WorkServer products can be used to address them.

2. Overview of the System

2.1 Design Objectives

The primary goal of the system is to ensure that it reliably provides all of the computing resources needed to support the execution of application software - in the example discussed here, Internet mail server software - which provides a useful service, or set of services, to the system's clients. In the best of all possible worlds, components would never fail, and the system, once configured, would continue to provide the needed resources forever. Unfortunately, components do fail. Ultimately, the responsibility for dealing with failures falls to a human being - for example, to ensure that failed components are repaired or replaced. In the environments for which this system was designed, however, the strategy of relying on a person to detect failures and reconfigure the system to restore lost computing resources is not acceptable, either because a person cannot be expected to take the required action quickly enough to avoid an unacceptable interruption of service, or because it is too expensive to provide the continuous level of staffing needed to respond effectively to such unplanned events.

For this reason, the system is designed to act as a stable platform for application software in the face of component failures, without human intervention. Instead of being equipped with only one set of the necessary components, the system is also equipped with one or more spares for each of these components. For some components, the system uses the spare to completely mask a component failure - that is, make the failure transparent to application software; in other cases, the system can detect a failure and reconfigure itself to replace the failed component with its spare, with minimal effect on the application, and without human intervention.

2.2 System Components

In the specific example discussed here, the system is an Internet mail server platform: it supports the execution of software the implements the SMTP and POP3 protocols for routing and storing Internet messages, and for transferring them to the system's clients, mail users. The computing resources needed to support this application are relatively simple:

A CPU, with boot device and operating system, is required to provide an execution environment for the server software. The system contains two GNP SPARC Ultra 1 Computer Modules, each with a 4GB SCSI boot disk and running SunSoft Solaris 2.5. At any time, only one of these computers - the active CPU - is actually executing server code; the other computer - the standby CPU - is available to become the active CPU if the currently active CPU fails.
An interface to TCP/IP transport is needed to connect the mail server processes to their clients. Each CPU is equipped with Sun X1018A Fast/Wide SCSI+100 Base-T Ethernet SBus cards, which, together with the SunOS TCP/IP protocol stack, are used to provide this connectivity to the outside world. The total set of I/O interface supported by the CPU motherboard and the pair of SBus expansion cards is: one 10 Base-T Ethernet interface, two 100 Base-T Ethernet interfaces, one narrow SCSI interface and two Fast/Wide SCSI interfaces.
A high-capacity random-access storage device is required to store users' Internet messages. The system contains two GNP Fast/Wide SCSI RAID Controller Modules, each attached to five 4 GB SCSI disk drives. As seen by the CPUs, each RAID set is effectively a high-throughput 20 GB SCSI disk. Using SunSoft DiskSuite 4.0 software running on each CPU, the two 20 GB devices are mirrored into a single logical 20 GB device, which masks all single failures and some multiple failures of the RAID set components.

[ IMAGE ]

Figure 1 -- Application I/O and Maintenance I/O Topologies

In addition to these components required to support the system's primary function, mail service, other components are needed to support maintenance and control of the system:

Each CPU is connected to its own Removable Media Module, which contains a SCSI CD-ROM and SCSI 4mm DAT drive, for loading and saving system software and data.
The system contains two GNP SCSI Switch Modules, which are used to reconfigure the CPU-to-RAID SCSI chains under software control.
Two GNP Maintenance Modules are provided. These modules have two functions: they provide electrical power to the system's Intelligent Maintenance Network (IMN), and they each contain one or more Programmable Serial Gateways (PSGs) that allow the IMN - and thus the system - to be controlled from the outside world.

3. Component Connectivity

Figure 1 contains two high-level views of the main system components and their connections to each other and to the outside world. The two views relate, respectively, to the two independent communication networks in the system:

The application network, based on the standard Ethernet and SCSI protocols, is used to transfer application data between the currently active CPU and its peripherals and the outside world. It also supports the exchange of status information between two pairs of high-availability (HA) daemons running on the CPUs. Please refer to Section 4.3 on page 8 for further details of the HA software architecture.
The maintenance network, based on GNP proprietary hardware and software, is used to transport status, alarm and control information between all of the system modules and one or more PSGs.

The PSG is a device that acts as a gateway between the maintenance-network protocol and a standard asynchronous terminal interface. On its terminal interface, the PSG supports a command set that allows an external system to query any system module for maintenance status, to receive module alarm reports, and to change the maintenance state of any module, for example, to power it on or off.

3.1 Application Network

As shown in the upper left of Figure 1, an external Ethernet segment acts as the medium over which the system provides mail service to its clients. The system's two CPU modules are connected to this segment via one of their 100Base-T Ethernet interfaces. The CPUs also have direct Ethernet connectivity to each other via their two remaining Ethernet interfaces, which are used to exchange status information between the HA daemons running on the two CPUs.

Each CPU has access, through a pair of SCSI switches, to a pair of RAID controllers and their associated RAID sets. The SCSI switch has two Fast/Wide SCSI "CPU" interfaces, A and B, and two Fast/Wide SCSI "RAID" interfaces, 1 and 2. Each of the two CPU interfaces on each SCSI switch is connected to one of the Fast/Wide SCSI interfaces on one of the CPUs, and one of the RAID interfaces on the SCSI switch is connected to the Fast/Wide SCSI "host" interface on one of the RAID controllers. The other RAID interface on the SCSI switch is connected to a SCSI terminator, and is not shown in the diagram to avoid clutter.

The SCSI switch has two possible states. In state A1B2, as the name suggests, it connects host interface A to RAID interface 1 and host interface B to RAID interface 2. In Figure 1, with both switches in state A1B2, the effect is to connect the upper CPU to both RAID controllers, and to terminate the SCSI chains on the lower CPU. In the other possible state of the SCSI switches, A2B1, the effect is reversed, that is, the upper CPU has both SCSI chains terminated, and the lower CPU is connected to both RAID controllers. The state of a SCSI switch can be changed by manipulating its front-panel switches, or by sending a maintenance command from a PSG to the SCSI switch over the maintenance bus.

Each RAID controller is connected to two chains of SCSI disks. The diagram shows twelve disks, housed in four disk carrier modules, with each RAID controller connected to five disks, and two of the disks connected to the boot SCSI chains of the CPUs to act as Solaris boot devices. By adding additional disk carrier modules, the complement of each RAID set can be increased to as many as fourteen standard SCSI disks per controller. The controller supports standard RAID levels 0, 1, 4 and 5. Please refer to Appendix B for further details of RAID technology.

3.2 Maintenance Network

The lower half of Figure 1 shows how the application modules described in the previous section are connected to the maintenance network. In fact, all GNP modules are connected to the maintenance network (including, for example, the system's fans) and most of the features described here also apply to the other modules, but we will concentrate on the modules needed directly to support the application.

Physically, the maintenance network is a free-topology CSMA/CD LAN operating at 78 kbps, that runs through and between the system's midplanes and fan assembly. The fan assembly and the modules described in the previous sections, which are all housed in the system's sub-racks, each contain a dedicated maintenance processor, which is connected to the maintenance network through a set of pins in the midplane. The midplane also supplies electrical power to the maintenance section of each module: this supply, which is provided by the system's Maintenance Modules, is independent of and electrically isolated from the -48 vdc supply used to power the other application sections of each module. Maintenance power is always on, so the maintenance section of a module will begin to operate as soon as it is plugged into the midplane, whether or not the rest of the module - for example, the CPU's motherboard - is also powered on.

The maintenance processor in each module controls the power converters that supply the module's application section: it can turn the converters on and off, and trim the supply voltages. It also monitors the supply voltages and currents, and can raise alarms or shut the module off when pre-set voltage, current or temperature thresholds are crossed. The maintenance processor also controls and responds to the module's front-panel lights and switches, translating user requests into the corresponding maintenance actions, and making alarm conditions visible at the front panel.

All of the maintenance-processor functions described above can be monitored and controlled via messages exchanged between the maintenance processor and a PSG, using the maintenance network. The maintenance-network transport protocol provides an acknowledged datagram service that is used to transfer maintenance data - control and status information - between modules and PSGs. The PSGs, in turn, convert between this protocol and a command-line interface on their asynchronous serial ports.

Some modules have other maintenance-network-based features in addition to these basic functions:

The CPU and RAID modules each contain a built-in console gateway (illustrated in the diagram by a dotted line) that interfaces the module's maintenance processor to a standard asynchronous serial port, which is normally connected to the console port on the CPU or RAID. The PSGs have a corresponding software feature that allows them to set up console connections to the CPU and RAID modules. With this connection in place, the maintenance network is used as a pipe to transfer data between the serial port on the CPU or RAID module and the serial port on the PSG. That is, the console port of the CPU or the RAID can be remotely accessed at the PSG. For the CPU, this feature provides remote access to the Sun OpenBoot monitor, allowing the CPU's boot sequence to be controlled remotely. For the RAID controller, it allows RAID set status and setup parameters to be queried and set from a remote location.
The CPU modules each contain a built-in PSG, which is normally connected to the CPU's terminal port B. This allows software on the CPU to monitor and control other modules using the PSG's command-line interface.
The SCSI-connection state of a SCSI switch module is controlled by the maintenance processor, and can be queried and changed via the module's front panel or via maintenance-network messages from a PSG.

4. Software Architecture

Figure 2 contains high-level views of the primary system software components, their relationships to each other, and their states in a normally operating system. The figure is divided into left and right halves representing, respectively, the roles adopted by the system's two CPUs. The view corresponding to each CPU divided again into two halves: the lower illustrates the relationships between the major system software layers, and the upper half is a more detailed view of processes that comprise the highest layer - the Veritas FirstWatch HA software and the application software.

4.1 Normal Operating Conditions

Normally, only one of the CPUs will be providing mail service to the system's clients. This active CPU will configure its public Ethernet interface with the "official" IP address of the mail server, and will configure the SCSI switches (using maintenance commands issued to the PSG connected to its terminal port B) to connect this CPU to both RAID controllers.

The other standby CPU does not participate in providing mail service, but monitors the state of the active CPU using the redundant Ethernet HA "heartbeat" links, and is prepared to take over the role of the active CPU if that CPU should fail or voluntarily relinquish its active role. The monitoring of the CPUs by each other, and their transitioning between the service-providing and non-service-providing states, are controlled by Veritas FirstWatch software running on both CPUs.

[ IMAGE ]

Figure 2 -- System Software Components

4.2 Software Layers

The system's application and high-availability functionality both rely on a set of services provided by lower layers of software. From the lowest layer up, the software layers are:

The SunSoft SunOS 5.5 kernel, which implements the fundamental software loading and execution environment, and contains the device drivers and protocol stack software needed to make use of the CPU's peripheral devices and communication interfaces.
The SunSoft Solaris 2.5 operating system, which provides the user-level scripting languages and other tools needed to manage the CPU platform and administer the system's software and data.
The SunSoft Solstice DiskSuite 4.0 package, which provides additional kernel-level and user-level functionality for accessing and managing the CPU's peripheral disks and RAID devices. Appendix A describes the features of DiskSuite. For the Internet mail server application described here, the most important features are:
- Disk Mirroring, which provides the ability to treat a pair of RAID set partitions as a single logical metadevice, with writes duplicated to both partitions and reads distributed across both partitions. Since all data stored on the mirrored metadevice is replicated, the DiskSuite software can mask the failure of one of the two sub-mirror devices. In the mail server application, the two sub-mirrors of each mirrored metadevice are stored on separate RAID sets.
- UNIX File System Logging, which enhances the functionality of the normal UFS file system provided by the SunOS kernel, by associating a separate logging metadevice with a UFS file system, and using it to store a log of transactions performed on the file system. After recovery from a CPU failure, the DiskSuite software can "play back" the transaction log to restore the UFS file system to a consistent state more quickly than would be possible using the normal UNIX fsck procedure. When dealing with large file systems, such as the one stored on the 20 GB RAID sets in the mail server application, this feature dramatically reduces the time needed to restore application service after a CPU failure. As an extra precaution, the metadevices used to construct the logging file system in the mail server application are themselves constructed from mirrored metadevices, so that they are unaffected by a RAID-set failure.
The Veritas FirstWatch High-Availability (HA) package, which provides a set of scriptable user-level programs for monitoring the integrity of the platform and application processes, for detecting service failures, and for responding to failures by re-initializing components or by causing a CPU fail-over, which interchanges the roles of the active and standby CPUs. The FirstWatch software is covered more fully in the following section.
The application software, which implements the mail server functions that the system provides to its clients.

4.3 HA and Application Software

The FirstWatch environment, illustrated in the upper half of Figure 2, consists of a set of user-level processes and scripts that run on the active and standby CPUs. This software is responsible for ensuring that the application services which the system is supposed to provide to the outside world, and the services that the application software itself requires in order to provide those services, are available at all times. If the software detects a service failure, it notifies maintenance personnel via console messages and entries in a system log, and it executes scripts which are designed to recover from the failure. It contains these components:

A pair of high-availability daemon processes (HAd) runs on each CPU. These processes are scheduled in the SunOS real-time (RT) class, and are responsible for monitoring and controlling all HA-related activity on the CPUs. The two HA daemons on each CPU monitor each other via RPC-based heartbeat messages, and if daemon fails, the other daemon will restart it. The daemon pair on each CPU also monitors the daemon pair on the other CPU via heartbeat messages: this is the primary mechanism by which FirstWatch CPU detects the catastrophic failure of its mate.
A set of service agents, which are started, monitored and scheduled at periodic intervals by the HA daemons, are responsible for running test scripts that confirm whether specific system services are still available. Normally, the agents will be configured to execute their test scripts on the active machine, since this is the only machine that is currently providing service. For the mail server application, several example agents are shown in Figure 2:
- A DiskSuite agent, which executes Solstice DiskSuite maintenance commands to check the status of all DiskSuite metadevices and confirms that the devices are all on-line and accessible by application software.
- A File System agent, which confirms that the /var/mail file system is still available for storing users' Internet messages, by checking the SunOS mount table, and creating, writing and reading a test file.
- An SMTP agent, which verifies that the SMTP-based Internet message transport service is still available on the active machine, by connecting to the SMTP TCP port on the local host, executing a transaction to send a message to a local test user, and confirming that the message is delivered in a timely manner.
- A POP agent, which confirms that POP-based mailbox services are still available, by executing test transactions on the local POP3 TCP port.

If an agent detects a loss of the service it is monitoring, it reports this fact to the HA daemons for logging, and performs the first-stage recovery action. Normally, the first-stage action is to execute a script which attempts to stop and restart the software component which is responsible for providing the service. For example, the agent monitoring SMTP mail service will attempt to stop and restart the sendmail daemon if it detects a loss of service. If the agent is not able to restore the service after a certain (configurable) number of attempts, it reports a service failure to the HA daemons. At this point, the HA daemons normally resort to more drastic recovery actions, such as causing a CPU failover.

The scripts used by each agent to test for availability of its service, the scripts used to restart a software component after a service has failed, and the scripts used to perform CPU state transitions are application-dependent and normally not provided directly as part of the FirstWatch package. Modification of existing prototype scripts, or development of new scripts, is normally required on the part of GNP Computers or the end customer.

4.4 CPU State Transitions

A CPU failover is the process of interchanging the states of the currently active and standby CPUs. This involves shutting down the application software on the current active CPU and de-commissioning its I/O interfaces, and moving all application functions to the current standby CPU, making it the new active CPU. The maintenance actions required to execute a failover are performed by scripts run by the HA daemons on each CPU. For the mail server application, there are two such scripts:

A startup script is used to bring a CPU into the active role. It performs the following actions:
- It issues a command to the PSG connected to the CPU's serial port B to set the state of the system's SCSI switches so that this CPU is connected to both RAID sets. It sets both SCSI switches to state A1B2 or state A2B1, depending on whether it is the upper or lower CPU.
- It executes DiskSuite maintenance commands to initialize the DiskSuite metadevices (sub-mirrors, mirrors, logging and metatrans devices) located on the RAID sets, and then mounts the UFS metatrans device on the /var/mail directory, which is used to store users' Internet messages.
- It configures the CPU's public Ethernet interface with the official IP address of the mail server.
- It starts the SMTP and POP3 service daemons sendmail and popd.
A shutdown script is used to relinquish a CPU's active role. It performs the following actions:
It stops the sendmail and popd daemons.
It deletes the official IP address of the mail server from the CPU's public Ethernet interface.
It unmounts the DiskSuite metatrans device from the /var/mail directory, and executes DiskSuite maintenance commands to release the DiskSuite metadevices.

5. Component Failure and Replacement

With a complete description of the system's hardware and software components in hand, we are ready to address the main issue raised at the beginning of this document: how the system makes use of its components to provide a high-availability platform for the application. There are two aspects to consider in tackling this question. First, we must show how the system responds in the immediate aftermath of a component failure to reduce or eliminate any effect on the application. Secondly - and this is often overlooked in basic discussions of the subject - we must also address the steps that maintenance personnel must take to replace the failed component, and show that they too can be performed without interrupting service. Without this second capability, increased system downtime and increased vulnerability to catastrophic dual failures are inevitable.

5.1 Disk Failure

When one of the disks in a RAID set fails, the RAID controller may or may not be able to completely mask the failure, depending on the RAID level at which the set is configured:

If the RAID set is configured at RAID level 0, the RAID controller performs striping of application data over all drives in the set. This is an I/O performance optimization, since reads and writes can be performed in parallel on the two SCSI disk chains, allowing the RAID controller to sustain higher data throughput with the CPU than would be possible with a single SCSI disk, or chain of disks. However, it does not provide any protection against disk failure, because there is no built-in data redundancy. That is, if a disk fails, the CPU will no longer be able to access the RAID device.
If the RAID set is configured at RAID level 3, 4 or 5, the RAID controller effectively uses all but one of the disks to stripe application data, and one of the disks to maintain parity information that allows the application data to be reconstructed if any single disk in the RAID set fails. The need to store parity information reduces the capacity of the RAID set for application data by one disk drive, but the RAID set will not fail if a drive fails. Instead, the RAID set enters the degraded state - so called because it cannot sustain another drive failure without failing itself - and issues an alarm that is visible on the module's front panel, and over the maintenance network.

Clearly, if the RAID controller is able to mask a disk failure, no other immediate recovery action is required. However, if the RAID controller is not able to mask the failure, either because it is configured at RAID level 0, or because it has sustained more than one disk failure, the CPU will no longer be able to communicate with that RAID set. At this point, the mirroring feature of Solstice DiskSuite comes into play: DiskSuite will take the affected sub-mirror devices off-line, but if the CPU can still access the other RAID set, reads and writes to the DiskSuite mirror device will proceed as normal. The DiskSuite monitoring agent will detect the failure of the sub-mirror devices and issue a warning message to the HA log.

5.2 Disk Replacement

The failed disk can be removed and replaced without powering off any other components.¹ After the drive has been replaced, it must be re-incorporated in the RAID set: if the RAID set itself failed as a result of the disk failure, the RAID controller will need to be restarted; otherwise, this operation can be performed while the RAID set is still on-line. After a failed RAID set has been restored, the corresponding DiskSuite sub-mirror devices will need to be re-synchronized with the active sub-mirrors, using DiskSuite maintenance commands executed on the active CPU.

¹ - This feature is usually referred to as the hot-swap capability.

5.3 SCSI Switch or RAID Controller Failure

The failure of a SCSI switch or RAID controller is equivalent to the failure of a RAID set as described above, because the active CPU will no longer be able to access the affected RAID set. The response of the Solstice DiskSuite mirroring software will also be as described above, and will serve to mask the failure for application software.

5.4 SCSI Switch or RAID Controller Replacement

The SCSI Switch or RAID controller can be replaced without powering off any other components. After the failed component has been replaced, the affected DiskSuite sub-mirror devices will need to be re-synchronized with the active sub-mirrors, using DiskSuite maintenance commands executed on the active CPU.

5.5 CPU or Boot Disk Failure

Catastrophic failure of the active or standby CPU or their boot drives will be detected by the HA daemons running on the other CPU, when they stop receiving heartbeats from the mate CPU. If the standby CPU failed, the HA daemons will immediately report this fact in the HA logs, but will take no further action. If the active CPU failed, the standby CPU will cause a failover, after waiting for a configurable period of time for the active CPU to recover. As a result, the application software will resume execution on the new active CPU, and will continue to provide mail service to the outside world.

If failure of the active CPU or its boot drive is not catastrophic - for example, if a single SBus card fails - one or more of the service-monitoring agents will detect the failure, or more accurately, the effect of the failure on the service, and will report this failure to the HA agents. Depending on the severity level associated with the agent, the HA daemons may respond to the report by causing a failover. All failures reported by agents are also recorded in the HA log.

5.6 CPU Replacement

A failed CPU module can be replaced without powering off any other components.² In addition, because of the extra level of isolation provided by the SCSI switch modules, the failed CPU can be removed without affecting the integrity of the active CPU's connections to the RAID sets. This would not be the case, for example, if the CPUs and RAID controllers were connected to each other in the following way, as is sometimes suggested in other literature:

[ IMAGE ]

Figure 3 -- A Non-Optimal CPU-to-RAID Connection Scheme

With this scheme, if one of the CPUs is removed from the system for repair, both SCSI chains are no longer properly terminated. As a result, for the remaining CPU, all SCSI transactions on these chains are likely to fail, resulting in an unrecoverable dual failure of the DiskSuite sub-mirror devices, and complete loss of service. In contrast, because the SCSI switches isolate the two CPUs' connections to the RAID controllers, removing one CPU has no effect on the other. This is the primary motivation for introducing the SCSI switches into the reference architecture.

² - The boot-disk replacement procedure is identical to the procedure for replacing a RAID-set disk described in Section 5.2.

5.7 Fan Failure and Replacement

The failure of any of the system's fans will be detected by the fan assembly's maintenance processor, and will result in a visible alarm on the fan assembly and an alarm notification on the maintenance network. The system can sustain failure of up to half of its fans without violating its published operating-environment specifications - that is, the system will not fail because of overheating. Each fan can be completely replaced without affecting any other system components, including the other fans.

5.8 Software Failure

The failure of system or application software is normally detected by agents running on the affected CPU, or by the HA daemons running on its mate if the failure is severe enough to cause the CPU to crash or hang. The agents or HA daemons will take the necessary recovery action, such as restarting the failed component or causing a CPU fail-over. Since most software failures in well-tested telecommunication systems are triggered by transient environmental conditions or improper configuration data, the recovery action normally has the desired effect of restoring the affected application service for an extended period.

5.9 Software Replacement and Upgrades

The ability to replace or upgrade software on a live system is one of the main advantages of a system with loosely-coupled redundant components, when compared with a tightly-coupled system in which hardware components operate in lock step. For example, in the system described here, it is possible to shut down the standby CPU, reload any or all software components using the CPU's Removable Media Module (CD-ROM or 4mm DAT), boot the CPU and execute component-level and selected application-level tests that verify the new software configuration, without interfering with the application running on the active CPU. Using the ability to cause a CPU fail-over from the FirstWatch maintenance console, it is also possible to soak the new software configuration under live traffic for a pre-determined interval, while retaining the ability to immediately fail-back to the earlier configuration if unexpected problems arise. After the new application configuration has been tested for as long as is necessary under load, the other CPU can be upgraded in the same manner as the first. The potential sticking point of compatibility between the old and new application data formats usually must be tackled anyway as part of the software upgrade plan.

6. Conclusion

We have described the construction and operation of a hypothetical reference system, consisting of GNP Telco WorkServer components together with off-the-shelf third-party products, that performs as a high-availability Internet mail server. We described the roles played by the system's industry-standard components and I/O interfaces in providing the capabilities needed to execute application software. We also described the role played by the WorkServer's Intelligent Maintenance Network in providing low-level monitoring and control access to all components, and how this capability can be used by HA software running on the system's CPUs, and by external systems connected to the PSGs, to manage the system. We showed how the HA software is able to monitor the services provided by lower-level system software and the application, and take the necessary recovery action if services fail, including the ability to migrate all system functions to the standby CPU. Finally, we reviewed the system's response to individual component failures, showing that it is possible for the system to mask or reconfigure around failures, and for maintenance personnel to take the necessary repair actions, without affecting the level of service provided by the mail server.

Appendix A Solstice DiskSuite

The SunSoft Solstice DiskSuite 4.0 software package provides a number of features that enhance the performance, reliability and manageability of sets of disks attached to a single computer, and simplify the use of sets of disks for storing application data. The central feature provided by DiskSuite is the ability for application software to treat a set of disk partitions as a single logical device, called a metadevice. DiskSuite software uses the following techniques to implement metadevices with better characteristics than the device, implemented by the base SunOS kernel, corresponding to any single disk partition:

Concatenation and Striping: A concatenated metadevice consists of a concatenated set of disk partitions. That is, metadevice logical disk block addresses are allocated sequentially to the component partitions to create a single device with the combined storage capacity of its components.

Striping is similar to concatenation except that the addressing of the metadevice blocks is interlaced on the components, rather than addressed sequentially, in order to achieve higher performance. The interlace value for striping is user-definable, and can be tuned for specific read/write performance characteristics.

Mirroring: A mirrored metadevice normally consists of a pair of mated devices with replicated data: writes to the mirror result in duplicated writes to the components, and reads from the mirror can be satisfied by a read from either component. DiskSuite supports mirroring to as many as three separate metadevices, which enables the system to tolerate single-component failures with two-way mirroring, and double failures with three-way mirroring.

To set up mirroring, one creates a meta-mirror, which is a metadevice made up of one or more other metadevices, which are called sub-mirrors. Once a meta-mirror is defined, additional sub-mirrors can be added at a later date without bringing the system down or disrupting reads and writes to existing sub-mirrors. When a sub-mirror is attached, all the data from another sub-mirror in the meta-mirror is automatically written to the newly attached sub-mirror - this process is called resyncing.

UFS Logging: Because the complete state of an active UNIX file system (UFS) is defined by data stored in the host's volatile memory as well as data saved on the storage device, a system crash can leave the storage device in an inconsistent state. For this reason, UFS file systems should be checked before they are mounted again: mounting a UFS file system without first checking it and repairing any inconsistencies may cause data corruption. Unfortunately, the process of checking large file systems is slow because it requires reading and verifying all of the file system's data structures. With DiskSuite's UFS logging facility, UFS file systems do not have to be checked at boot time.

A pseudo device, called the metatrans device, is responsible for managing the contents of the log of file system updates. Like other metadevices, the metatrans device behaves like an ordinary disk device. The metatrans device is made up of two sub-devices: the logging device and the master devices. The logging device contains the log of file system updates, that is, a sequence of records, each describing a change to a file system. The master device contains an existing or a newly created UFS file system. The master device can contain an existing UFS file system because creating a metatrans device does not alter the master device. The difference is that updates to the file system are written to the log before being "rolled forward" to the UFS file system. The master device is never left in an inconsistent state, and DiskSuite software can examine the transaction log after a system crash to recover file-system changes that were not committed to the master device.

State Database Replicas: The DiskSuite state database contains the information necessary to keep track of configuration and status information for all metadevices, meta-mirrors and metatrans devices, and hot spares. The state database is stored in replicas that are distributed across all of the disk drives managed by DiskSuite, allowing the system to access its configuration information even if one or more disks fail. The replicas also protect the state database against corruption that can result from a system crash. Each replica of the state database contains a checksum, and when the state database is updated, each replica is modified one at a time. If a crash occurs while the database is being updated, only one of the replicas will be corrupted. When the system reboots, the metadevice driver uses the checksum embedded in the replicas to determine if a replica has been corrupted. Any corrupted replicas are ignored.

Appendix B Redundant Arrays of Independent Disks (RAID)

RAID technology provides an elegant solution for reliable data storage. The basic idea of RAID is to use the combined storage capacity and I/O bandwidth of a set of disks, together with an microprocessor-based RAID controller which has I/O connectivity to all of the disks and to one or more host computers, to implement a storage device that has greater capacity, throughput and reliability than any of the component disks.

Efficient Write Techniques: A RAID controller uses several techniques to streamline write operations and significantly improve performance. In the case of GNP's RAID controller module, the techniques described below make use of the controller's onboard cache, which can contain up to 64 MB of memory in the form of a standard 72-pin, 60-ns SIMM:
- Write-Back Caching: When the host sends data to be written to a redundant group of disks, the controller stores the data in the cache and immediately informs the host that the write action is complete. Thus the host need not wait for the lengthy RAID write before processing another task. The controller eventually writes the data to the disk drives, when the write can be done most efficiently or when the controller must flush the cache to make room for other data or before a shutdown.
- Write Gathering: Efficient writes to the disk drives can be achieved by write gathering. The controller consolidates multiple writes destined for contiguous blocks and then writes the entire set of data blocks to the disk drives in one operation, reducing the total number of write operations significantly.
RAID Levels: The term RAID level refers to the selection of algorithms, standardized by the RAID Advisory Board, for distributing application data across the disks in a RAID set. The GNP RAID controller module supports several levels:
- RAID 0 (striping) breaks up the data stream into smaller chunks and then writes each chunk to a different drive in the disk array. The advantage of RAID 0 is its high bandwidth, since the controller can use multiple drive channels to write the data chunks to the disk drives. However this sequential write to the drives does not provide redundancy and no parity is used, thus the system is not protected from a drive failure.
- RAID 1 (mirroring) duplicates the data received from the host and sends it to all the drives in the redundant array. This results in a high degree of data availability, since as long as there is still one disk that hasn't failed, full access to the data is maintained. However, the price for this data availability is that multiple drives are needed to achieve the storage capability of a single drive.
- RAID 0+1 is a compromise between RAID 0 and RAID 1. Half of the disks in the array are used for duplicated data copies, and the other half of the disks are used for storing the smaller chunks of data in sequence. This technique helps to achieve a certain degree of high availability and takes advantage of the storage capacity of half of the disks.
- RAID 4 makes use of a dedicated parity drive to achieve high storage capacity and safety against one disk drive failure. The controller breaks up the host data into chunks, calculates the parity data by performing an exclusive-or operation on the chunks, and writes the data chunks to all but one drive in the array, and the parity data to the parity drive. The redundant parity data allows the RAID controller to completely reconstruct the original application data, even if one of the disk drives fails, that is, the host computer will still be able to read and write to the RAID set when one drive fails. This technique only guards against one disk drive failure, because if more than one drive fails it is not possible to reconstruct the original data from the parity data.
- RAID 5 is an improvement over RAID 4 for conditions in which the parity drive becomes an I/O bottleneck. In RAID 4, there is only one fixed drive for parity data, which significantly delays the operation if there are lots of widely scattered I/O operations, which could have been done concurrently but were made impossible by the sequential accesses to the parity drive. RAID 5 allows the parity data to be striped across all disks, just like application data. Since there is still only effectively only one drive's worth of parity data in the RAID set, RAID 5 can only withstand a single drive failure, like RAID 4.

For more information about the WorkServer, High Availability, or other GNP Computers products and news, please contact us or visit our Website at http://www.gnp.com.

GNP Computers High Availability White Paper

Table of Contents

1. Introduction

2. Overview of the System

2.1 Design Objectives

2.2 System Components

3. Component Connectivity

3.1 Application Network

3.2 Maintenance Network

4. Software Architecture

4.1 Normal Operating Conditions

4.2 Software Layers

4.3 HA and Application Software

4.4 CPU State Transitions

5. Component Failure and Replacement

5.1 Disk Failure

5.2 Disk Replacement

5.3 SCSI Switch or RAID Controller Failure

5.4 SCSI Switch or RAID Controller Replacement

5.5 CPU or Boot Disk Failure

5.6 CPU Replacement

5.7 Fan Failure and Replacement

5.8 Software Failure

5.9 Software Replacement and Upgrades

6. Conclusion

Appendix A Solstice DiskSuite

Appendix B Redundant Arrays of Independent Disks (RAID)

WorkServer Photos | WorkServer Brochure | WorkServer White Paper
High Availability White Paper | Technical Specifications | More Info

Copyright © 1996 GNP Computers, Inc.

GNP Computers High Availability White Paper

WorkServer Photos | WorkServer Brochure | WorkServer White Paper High Availability White Paper | Technical Specifications | More Info

WorkServer Photos | WorkServer Brochure | WorkServer White Paper
High Availability White Paper | Technical Specifications | More Info